Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 9

applications like the de novo genome assembly, variant discovery, and epigenetics. Usually,

genomes have long repeated sequences ranging from hundreds to thousands of bases that

are hard to cover with short reads produced by the NGS. The foundation for TGS was

emerged in 2003, when the DNA polymerase was used to obtain a sequence of 5 bp from

a single DNA molecule by using fluorescent microscopy [4]. The single-molecule sequenc-

ing (SMS) then evolved to include (i) direct imaging of individual DNA molecules using

advanced microscopy techniques and (ii) nanopore sequencing technologies in which a

single molecule of DNA is threaded through a nanopore and molecule bases are detected

as they pass through the nanopore. Although TGS provides long reads (from a few hun-

dred to thousands of base pairs), that may come at the expense of the accuracy. However,

lately, the accuracy of the TGS has been greatly improved. The TGS provides long reads

that can enhance de novo assembly and enable direct detection of haplotypes and higher

consensus accuracy for better variant discovery. In general, there are two TGS technolo-

gies that are currently available: (i) Pacific Bioscience (PacBio) single-molecule real-time

(SMRT) sequencing and (ii) Oxford Nanopore Technologies (ONTs).

1.2.3.1 PacBio Technology

The Pacific Biosciences (PacBio) sequencing can provide long reads that range between

500 and 50,000 bp. The PacBio sequencing has been improved since it made debut in 2011.

The underlying technology of the PacBio is based on the SMRT sequencing, in which a

single DNA molecule is sequenced and the base calling is given in the real time, while the

sequencing is in progress [5]. The sequencing steps include fragmentation and ligation of

adaptors to the DNA template for library generation. Special loop adaptors are ligated to

the ssDNA produced from the double-stranded DNA (dsDNA). The loop adaptors link

both strands forming structures called linear DNA SMRTbells. The sequencing takes place

on nano wells on a flow cell. The nano wells are made of silicon dioxide chips called zero-

mode waveguides (ZMWs) [6]. A cell contains thousands of ZMWs. A ZMW is around

70 nm in diameter and 100 nm in depth, and it allows laser light to come through the bot-

tom to excite the fluorescent dye. A DNA polymerase is attached to the bottom of the nano

well. When a DNA single fragment is added to the well, the DNA polymerase is attached

to it. The polymerase has the strand displacement capability that converts the DNA

SMRTbells into circular structure called circular DNA SMRTbell (Figure 1.5). Then, the

DNA polymerase continues adding nucleotides to form a complementary strand for both

forward and reverse strands. PacBio uses SBS approach. Four fluorescently labeled nucleo-

tides (dNTPs) are added to the reactions. The dNTPs are fluorescently labeled by attach-

ing the fluorescent dyes to phosphate chain of the nucleotides. Each time a fluorescently

labeled nucleotide is incorporated, a fluorescent dye is cleaved from the growing nucleic

acid chain before the next nucleotide is added. The fluorescence is then excited by the light

coming through the bottom of the well and detected in the real time. The real-time identi-

fication of the incorporated labeled nucleotides allows the base call. This process of adding

nucleotides and fluorescence detection continues until the entire fragment is sequenced.

This pass can be repeated different times to generate more accurate reads by the circular

consensus sequences (CCS). Ten passes produce reads with 99.9% accuracy. These reads are